feat: Add new component CSVDocumentSplitter to recursively split CSV documents #8815

sjrl · 2025-02-05T11:34:51Z

Related Issues

fixes Create a CSV Document splitter #8784

Proposed Changes:

Alternative approach as discussed in this PR: #8795

How did you test it?

Added unit tests.

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2025-02-05T11:47:25Z

Pull Request Test Coverage Report for Build 13239732856

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.05%) to 92.299%

Totals
Change from base Build 13237729498:	0.05%
Covered Lines:	9288
Relevant Lines:	10063

💛 - Coveralls

…csv-splitter

… or both

…litters

…csv-splitter

alex-stoica · 2025-02-09T14:28:13Z

Hey @sjrl I've tested your implementation a little bit

Performance:
When running the tests test_large_standard_case and test_large_side_by_side_case, I observed the following timings:

14.66s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_side_by_side_case
14.48s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_standard_case

which, on average, is a 2x improvement in terms of speed vs the previous BFS approach. 💪 good job

Correctness
Consider the test case test_complex_bridging

    def test_complex_bridging(self) -> None:
        """
        Rows bridging from left to right => BFS splits each row into left block & right block.
        """
        csv_data = """ID,LeftVal,,,RightVal,Extra
1,Hello,,,World,Joined
2,StillLeft,,,StillRight,Bridge

A,B,,,C,D
E,F,,,G,H
"""
        splitter = CSVDocumentSplitterV2(row_split_threshold=1, column_split_threshold=1)
        result = splitter.run([Document(content=csv_data)])
        docs = result["documents"]
        assert len(docs) == 4
        block_texts = [doc.content for doc in docs]
        assert any("ID,LeftVal" in text for text in block_texts)
        assert any("Hello" in text for text in block_texts)
        assert any("World,Joined" in text for text in block_texts)
        assert any("StillLeft" in text for text in block_texts)
        assert any("StillRight,Bridge" in text for text in block_texts)
        assert any("A,B" in text for text in block_texts)
        assert any("C,D" in text for text in block_texts)
        assert any("E,F" in text for text in block_texts)
        assert any("G,H" in text for text in block_texts)

In this scenario, the expected output should yield 4 documents rather than 2. skip_blank_lines in pandas is default True and this causes the problem. The same issue occurs for the test test_empty_rows_and_side_tables

Usability
Overall, I find your implementation to be more intuitive and easier to understand

sjrl · 2025-02-10T07:50:56Z

Hey @sjrl I've tested your implementation a little bit

Performance: When running the tests test_large_standard_case and test_large_side_by_side_case, I observed the following timings:

14.66s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_side_by_side_case
14.48s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_standard_case

which, on average, is a 2x improvement in terms of speed vs the previous BFS approach. 💪 good job

Correctness Consider the test case test_complex_bridging

    def test_complex_bridging(self) -> None:
        """
        Rows bridging from left to right => BFS splits each row into left block & right block.
        """
        csv_data = """ID,LeftVal,,,RightVal,Extra
1,Hello,,,World,Joined
2,StillLeft,,,StillRight,Bridge

A,B,,,C,D
E,F,,,G,H
"""
        splitter = CSVDocumentSplitterV2(row_split_threshold=1, column_split_threshold=1)
        result = splitter.run([Document(content=csv_data)])
        docs = result["documents"]
        assert len(docs) == 4
        block_texts = [doc.content for doc in docs]
        assert any("ID,LeftVal" in text for text in block_texts)
        assert any("Hello" in text for text in block_texts)
        assert any("World,Joined" in text for text in block_texts)
        assert any("StillLeft" in text for text in block_texts)
        assert any("StillRight,Bridge" in text for text in block_texts)
        assert any("A,B" in text for text in block_texts)
        assert any("C,D" in text for text in block_texts)
        assert any("E,F" in text for text in block_texts)
        assert any("G,H" in text for text in block_texts)

In this scenario, the expected output should yield 4 documents rather than 2. skip_blank_lines in pandas is default True and this causes the problem. The same issue occurs for the test test_empty_rows_and_side_tables

Usability Overall, I find your implementation to be more intuitive and easier to understand

Thanks for taking a look @alex-stoica! I’ll make the changes you suggest and add the test cases for this.

…csv-splitter

haystack/components/preprocessors/csv_document_splitter.py

davidsbatista

LGTM - we should now also have an issue to keep track the documentation for this new component

sjrl · 2025-02-10T17:10:33Z

LGTM - we should now also have an issue to keep track the documentation for this new component

Opened issue here: #8835

CSV Document Splitter

25a94dd

github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 5, 2025

sjrl mentioned this pull request Feb 5, 2025

feat: Add CSV splitter with side-by-side BFS detection #8795

Closed

sjrl added 2 commits February 5, 2025 12:40

Add license header

d8f62c9

Add newline

92a1440

sjrl added 10 commits February 5, 2025 12:50

Add to docs

f8bdfdc

Add lineterminator

a5abe52

Merge branch 'main' of github.com:deepset-ai/haystack into recursive-…

005f113

…csv-splitter

Updated csv splitter to allow user to specify to split by row, column…

5d94f76

… or both

Adding more tests

cb9b766

Column tests

78f71b5

Some refactoring to remove incorrect dropna call

990062c

Fix

4ecaea6

More complicated test

1003904

Adding more relevant metadata to match whats provided in our other sp…

626b6ba

…litters

sjrl marked this pull request as ready for review February 7, 2025 13:30

sjrl requested review from a team as code owners February 7, 2025 13:30

sjrl requested review from dfokina and davidsbatista and removed request for a team February 7, 2025 13:30

sjrl added 4 commits February 7, 2025 14:37

value error tests

1c5368a

Fix mypy

18d6e40

Merge branch 'main' of github.com:deepset-ai/haystack into recursive-…

22dff10

…csv-splitter

Docstring updates

34a7dc4

sjrl added 4 commits February 10, 2025 10:11

Merge branch 'main' of github.com:deepset-ai/haystack into recursive-…

703158a

…csv-splitter

Add skip_blank_lines=False

e263898

Add to dict test

3d5df70

More from and to dict tests

8b5ad10

github-actions bot added the topic:core label Feb 10, 2025

Fixes

4d4ea34

sjrl removed the topic:core label Feb 10, 2025

Merge branch 'main' into recursive-csv-splitter

ca20f74

davidsbatista reviewed Feb 10, 2025

View reviewed changes

haystack/components/preprocessors/csv_document_splitter.py Outdated Show resolved Hide resolved

Move dict creation outside of for loop

dc12957

davidsbatista approved these changes Feb 10, 2025

View reviewed changes

sjrl merged commit f9e6e48 into main Feb 10, 2025
19 checks passed

sjrl deleted the recursive-csv-splitter branch February 10, 2025 17:10

sjrl mentioned this pull request Feb 11, 2025

fix: Fix csv document splitter when table is only one row wide #8839

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add new component CSVDocumentSplitter to recursively split CSV documents #8815

feat: Add new component CSVDocumentSplitter to recursively split CSV documents #8815

sjrl commented Feb 5, 2025

coveralls commented Feb 5, 2025 •

edited

Loading

alex-stoica commented Feb 9, 2025 •

edited

Loading

sjrl commented Feb 10, 2025

davidsbatista left a comment

sjrl commented Feb 10, 2025

feat: Add new component CSVDocumentSplitter to recursively split CSV documents #8815

feat: Add new component CSVDocumentSplitter to recursively split CSV documents #8815

Conversation

sjrl commented Feb 5, 2025

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

coveralls commented Feb 5, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13239732856

Details

💛 - Coveralls

alex-stoica commented Feb 9, 2025 • edited Loading

sjrl commented Feb 10, 2025

davidsbatista left a comment

Choose a reason for hiding this comment

sjrl commented Feb 10, 2025

coveralls commented Feb 5, 2025 •

edited

Loading

alex-stoica commented Feb 9, 2025 •

edited

Loading